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Abstract 

We analyze a class of estimators based on convex relaxation for solving high-dimensional 
matrix decomposition problems. The observations are noisy realizations of a linear trans- 
formation X of the sum of an (approximately) low rank matrix Q* with a second matrix 
r* endowed with a complementary form of low-dimensional structure; this set-up includes 
many statistical models of interest, including forms of factor analysis, multi-task regres- 
sion with shared structure, and robust covariance estimation. We derive a general theorem 
that gives upper bounds on the Frobenius norm error for an estimate of the pair (O*, T*) 
obtained by solving a convex optimization problem that combines the nuclear norm with 
a general decomposable regularizer. Our results are based on imposing a "spikiness" con- 
dition that is related to but milder than singular vector incoherence. We specialize our 
general result to two cases that have been studied in past work: low rank plus an entry- 
wise sparse matrix, and low rank plus a columnwise sparse matrix. For both models, our 
theory yields non-asymptotic Frobenius error bounds for both deterministic and stochas- 
tic noise matrices, and applies to matrices Q* that can be exactly or approximately low 
rank, and matrices F* that can be exactly or approximately sparse. Moreover, for the case 
of stochastic noise matrices and the identity observation operator, we establish matching 
lower bounds on the minimax error, showing that our results cannot be improved beyond 
constant factors. The sharpness of our theoretical predictions is confirmed by numerical 
simulations. 



1 Introduction 



The focus of this paper is a class of high-dimensional matrix decomposition problems of the 
following variety. Suppose that we observe a matrix Y S that is (approximately) 

equal to the sum of two unknown matrices: how to recover good estimates of the pair? Of 
course, this problem is ill-posed in general, so that it is necessary to impose some kind of 
low-dimensional structure on the matrix components, one example being rank constraints. 
The framework of this paper supposes that one matrix component (denoted ©*) is low-rank, 
either exactly or in an approximate sense, and allows for general forms of low-dimensional 
structure for the second component F*. Two particular cases of structure for F* that have been 
considered in past work are elementwise sparsity O El [7j and column- wise sparsity [181 [29] . 

Problems of matrix decomposition are motivated by a variety of applications. Many 
classical methods for dimensionality reduction, among them factor analysis and principal 
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components analysis (PCA), are based on estimating a low-rank matrix from data. Different 
forms of robust PCA can be formulated in terms of matrix decomposition using the matrix 
r* to model the gross errors [9l [TJ [29]. Similarly, certain problems of robust covariance 
estimation can be described using matrix decompositions with a column/row-sparse structure, 
as we describe in this paper. The problem of low rank plus sparse matrix decomposition also 
arises in Gaussian covariance selection with hidden variables [8], in which case the inverse 
covariance of the observed vector can be decomposed as the sum of a sparse matrix with a 
low rank matrix. Matrix decompositions also arise in multi-task regression [321 \2T\ [27] , which 
involve solving a collection of regression problems, referred to as tasks, over a common set 
of features. For some features, one expects their weighting to be preserved across features, 
which can be modeled by a low-rank constraint, whereas other features are expected to vary 
across tasks, which can be modeled by a sparse component [5l[2]. See Section [2.11 for further 
discussion of these motivating applications. 

In this paper, we study a noisy linear observation that can be used to describe a number 
of applications in a unified way. Let X be a linear operator that maps matrices in 
matrices in M"i^"2. In the simplest of cases, this observation operator is simply the identity 
mapping, so that we necessarily have ni = di and n2 = d2- However, as we discuss in the 
sequel, it is useful for certain applications, such as multi-task regression, to consider more 
general linear operators of this form. Hence, we study the problem matrix decomposition for 
the general linear observation model 



where 0* and T* are unknown di x d2 matrices, and W G ]R"-i^"2 ggme type of observation 
noise; it is potentially dense, and can either be deterministic or stochastic. The matrix G* is 
assumed to be either exactly low-rank, or well-approximated by a low-rank matrix, whereas 
the matrix V* is assumed to have a complementary type of low-dimensional structure, such 
as sparsity. As we discuss in Section 12. 1[ a variety of interesting statistical models can be 
formulated as instances of the observation model ([I|). Such models include versions of factor 
analysis involving non-identity noise matrices, robust forms of covariance estimation, and 
multi-task regression with some features shared across tasks, and a sparse subset differing 
across tasks. Given this observation model, our goal is to recover accurate estimates of the 
decomposition (0*,r*) based on the noisy observations Y. In this paper, we analyze simple 
estimators based on convex relaxations involving the nuclear norm, and a second general norm 



Most past work on the model ([T]) has focused on the noiseless setting {W = 0), and for 
the identity observation operator (so that X(0* -|- T*) = 0* -|- F*). Chandrasekaran et al. [9] 
studied the case when F* is assumed to sparse, with a relatively small number s <^ did2 of 
non-zero entries. In the noiseless setting, they gave sufficient conditions for exact recovery 
for an adversarial sparsity model, meaning the non-zero positions of F* can be arbitrary. 
Subsequent work by Candes et al. [7] analyzed the same model but under an assumption of 
random sparsity, meaning that the non-zero positions are chosen uniformly at random. In 
very recent work, Xu et al. |29j have analyzed a different model, in which the matrix F* is 
assumed to be columnwise sparse, with a relatively small number s ^ ^2 of non-zero columns. 
Their analysis guaranteed approximate recovery for the low-rank matrix, in particular for the 
uncorrupted columns. After initial posting of this work, we became aware of recent work by 
Hsu et al. |14j . who derived Frobenius norm error bounds for the case of exact elementwise 
sparsity. As we discuss in more detail in Section [331 in this special case, our bounds are based 



Y = X(0* + F*) + W, 




n. 
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on milder conditions, and yield sharper rates for problems where the rank and sparsity scale 
with the dimension. 

Our main contribution is to provide a general oracle-type result (Theorem [T|) on approx- 
imate recovery of the unknown decomposition from noisy observations, valid for structural 
constraints on T* imposed via a decomposable regularizer. The class of decomposable regular- 
izers, introduced in past work by Negahban et al. [H], includes the elementwise £i-norm and 
columnwise (2, l)-norm as special cases, as well as various other regularizers used in practice. 
Our main result is stated in Theorem [T] it provides finite-sample guarantees for estimates 
obtained by solving a class of convex programs formed using a composite regularizer. The 
resulting Probenius norm error bounds consist of multiple terms, each of which has a natural 
interpretation in terms of the estimation and approximation errors associated with the sub- 
problems of recovering 0* and F*. We then specialize Theorem[T]to the case of elementwise or 
columnwise sparsity models for F*, thereby obtaining recovery guarantees for matrices Q* that 
may be either exactly or approximately low-rank, as well as matrices F* that may be either 
exactly or approximately sparse. We provide non-asymptotic error bounds for general noise 
matrices W both for elementwise and columnwise sparse models (see Corollaries [1] through 
Corollary [6]) . To the best of our knowledge, these are the first results that apply to this broad 
class of models, allowing for noisiness {W ^ 0) that is either stochastic or deterministic, ma- 
trix components that are only approximately low-rank and/or sparse, and general forms of 
the observation operator X. 

In addition, the error bounds obtained by our analysis are sharp, and cannot be improved 
in general. More precisely, for the case of stochastic noise matrices and the identity observation 
operator, we prove that the squared Frobenius errors achieved by our estimators are minimax- 
optimal (see Theorem [2]). An interesting feature of our analysis is that, in contrast to previous 
work [9t 129^ 17]. we do not impose incoherence conditions on the singular vectors of G*; rather, 
we control the interaction with a milder condition involving the dual norm of the regularizer. 
In the special case of elementwise sparsity, this dual norm enforces an upper bound on the 
"spikiness" of the low-rank component, and has proven useful in the related setting of noisy 
matrix completion [20]. This constraint is not strong enough to guarantee identifiability of 
the models (and hence exact recovery in the noiseless setting), but it does provide a bound on 
the degree of non-identifiability. We show that this same term arises in both the upper and 
lower bounds on the problem of approximate recovery that is of interest in the noisy setting. 

The remainder of the paper is organized as follows. In Section [2l we set up the problem 
in a precise way, and describe the estimators. Section [3] is devoted to the statement of our 
main result on achievability, as well as its various corollaries for special cases of the matrix 
decomposition problem. We also state a matching lower bound on the minimax error for 
matrix decomposition with stochastic noise. In Section HI we provide numerical simulations 
that illustrate the sharpness of our theoretical predictions. Section [5] is devoted to the proofs 
of our results, with certain more technical aspects of the argument deferred to the appendices, 
and we conclude with a discussion in Section [6j 



Notation: For the reader's convenience, we summarize here some of the standard notation 
used throughout this paper. For any matrix M E M.'^^^'^^, we define the Frobenius norm 

|||M|||f := Ylk=i ^jk^ corresponding to the ordinary Euclidean norm of its entries. 

We denote its singular values by ai{M) > (T2{M) > ■ ■ ■ > a^iM) > 0, where d = minjdi, ^2}- 
Its nuclear norm is given by |||M|||n = Z^j=iCj(M). 
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2 Convex relaxations and matrix decomposition 



In this paper, we consider a family of regularizers formed by a combination of the nuclear norm 
II|0|||n '■ = (j_.(^Q'j^ which acts as a convex surrogate to a rank constraint for Q* (e.g., 

see Recht et al. |25j and references therein), with a norm-based regularizer IZ : _^ 
used to constrain the structure of T* . We provide a general theorem applicable to a class of 
regularizers TZ that satisfy a certain decomposability property [19], and then consider in detail 
a few particular choices of TZ that have been studied in past work, including the elementwise 
£i-norm, and the columnwise (2, l)-norm (see Examples H] and [5] below). 

2.1 Some motivating applications 

We begin with some motivating applications for the general linear observation model with 
noise ([TJ. 

Example 1 (Factor analysis with sparse noise). In a factor analysis model, random vectors 
Zi € are assumed to be generated in an i.i.d. manner from the model 

Zi = LUi+Ei, for ? = 1,2,... ,n, (2) 

where L G M'^i^^ is a loading matrix, and the vectors Ui ~ N[0,lj.xr) and £i ~ A^(0, F*) are 
independent. Given n i.i.d. samples from the model ([2]), the goal is to estimate the loading 
matrix L, or the matrix LL^ that projects onto column span of L. A simple calculation shows 
that the covariance matrix of Zi has the form S = LL^ + F*. Consequently, in the special 
case when F* = CF^Idxdj then the range of L is spanned by the top r eigenvectors of S, and 
so we can recover it via standard principal components analysis. 

In other applications, we might no longer be guaranteed that F* is the identity, in which 
case the top r eigenvectors of S need not be close to column span of L. Nonetheless, when 
F* is a sparse matrix, the problem of estimating LL^ can be understood as an instance of 
our general observation model ([I]) with di = d2 = d, and the identity observation operator 
X (so that ni = 712 = d). In particular, if the let the observation matrix Y € R'^^'^ be the 
sample covariance matrix - ZiZf , then some algebra shows that y = 0* + F* + W, 
where Q* = LL^ is of rank r, and the random matrix 11^ is a re-centered form of Wishart 
noise [T] — in particular, the zero-mean matrix 

1 " 

W:=-y^Z,Zf-{LL^ + T*]. (3) 

i=l 

When F* is assumed to be elementwise sparse (i.e., with relatively few non-zero entries), then 
this constraint can be enforced via the elementwise ^i-norm (see Example H] to follow). ^ 

Example 2 (Multi-task regression). Suppose that we are given a collection of d2 regres- 
sion problems in , each of the form yj = X(3* + Wj for j = 1,2, ...,^2- Here each 
/3j G W^^ is an unknown regression vector, wj S is observation noise, and X G JR't-x'^i 
is the design matrix. This family of models can be written in a convenient matrix form as 
Y = XB* + W, where Y = [yi ■ ■ ■ y^j] and W = [wi • • • Wd^] are both matrices in M"^'^2 
and B* :=[I31 ■■■ G ]^dixd2 matrix of regression vectors. Following standard termi- 
nology in multi-task learning, we refer to each column of B* as a task, and each row of B* as 
a feature. 
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In many applications, it is natural to assume that the feature weightings — i.e., that is, the 
vectors (3* G M^2_gxhibit some degree of shared structure across tasks [21 [32l [2T| [27] . This 
type of shared structured can be modeled by imposing a low-rank structure; for instance, in 
the extreme case of rank one, it would enforce that each /?* is a multiple of some common 
underlying vector. However, many multi-task learning problems exhibit more complicated 
structure, in which some subset of features are shared across tasks, and some other subset of 
features vary substantially across tasks [2113]. For instance, in the Amazon recommendation 
system, tasks correspond to different classes of products, such as books, electronics and so on, 
and features include ratings by users. Some ratings (such as numerical scores) should have a 
meaning that is preserved across tasks, whereas other features (e.g., the label "boring") are 
very meaningful in applications to some categories (e.g., books) but less so in others (e.g., 
electronics) . 

This kind of structure can be captured by assuming that the unknown regression matrix 
B* has a low-rank plus sparse decomposition — namely, B* = 0*+r* where 0* is low-rank and 
r* is sparse, with a relatively small number of non-zero entries, corresponding to feature/task 
pairs that that differ significantly from the baseline. A variant of this model is based on 
instead assuming that T* is row-sparse, with a small number of non-zero rows. (In Example [5] 
to follow, we discuss an appropriate regularizer for enforcing such row or column sparsity.) 
With this model structure, we then define the observation operator X : _^ ^nxd2 yjg^ 

A I— )• XA, so that ni = n and 712 = (^2 in our general notation. In this way, we obtain another 
instance of the linear observation model Jl» 



Example 3 (Robust covariance estimation). For i = 1,2, ...,n, let Ui G be samples 
from a zero- mean distribution with unknown covariance matrix G*. When the vectors Ui 
are observed without any form of corruption, then it is straightforward to estimate G* by 
performing PCA on the sample covariance matrix. Imagining that j £ {1, 2, . . . ,d} indexes 
different individuals in the population, now suppose that the data associated with some subset 
S of individuals is arbitrarily corrupted. This adversarial corruption can be modeled by 
assuming that we observe the vectors Zi = Ui + Vi for i = 1, . . . , n, where each Vi S M*^ is a 
vector supported on the subset S. Letting Y = ^ Y17=i ^i^T sample covariance matrix 

of the corrupted samples, some algebra shows that it can be decomposed as y = G* + A + W, 
where W : = ^ Y17=i UU^ — Q* is again a type of re-centered Wishart noise, and the remaining 
term can be written as 



Note that A itself is not a column-sparse or row-sparse matrix; however, since each vector 
Vi G M'^ is supported only on some subset S C {1,2, ... ,d}, we can write A = + (F*)-^, 
where F* is a column-sparse matrix with entries only in columns indexed by S. This structure 
can be enforced by the use of the column-sparse regularizer (jl2p . as described in Example [5] 
to follow. 




(4) 



i=l 



i=l 
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2.2 Convex relcixation for noisy matrix decomposition 

Given the observation model Y = X(G* + F*) + W, it is natural to consider an estimator 
based on solving the regularized least-squares program 

min - x(e + r)|||2 + Ad|||e|||N + Mr)]. 

(o,r) 1^ / J 

Here {Xdj^d) are non-negative regularizer parameters, to be chosen by the user. Our theory 
also provides choices of these parameters that guarantee good properties of the associated 
estimator. Although this estimator is reasonable, it turns out that an additional constraint 
yields an equally simple estimator that has attractive properties, both in theory and in prac- 
tice. 

In order to understand the need for an additional constraint, it should be noted that 
without further constraints, the model ([T|) is unidentifiable, even in the noiseless setting 
(W = 0). Indeed, as has been discussed in past work [9l [71 [29], no method can recover the 
components (0*,r*) unless the low-rank component is "incoherent" with the matrix T*. For 
instance, supposing for the moment that F* is a sparse matrix, consider a rank one matrix with 
Q\i 7^ 0, and zeros in all other positions. In this case, it is clearly impossible to disentangle 
Q* from a sparse matrix. Past work on both matrix completion and decomposition [9l [71 [29] 
has ruled out these types of troublesome cases via conditions on the singular vectors of the 
low-rank component G*, and used them to derive sufficient conditions for exact recovery in 
the noiseless setting (see the discussion following Example [4] for more details). 

In this paper, we impose a related but milder condition, previously introduced in our past 
work on matrix completion [20], with the goal of performing approximate recovery. To be 
clear, this condition does not guarantee identifiability, but rather provides a bound on the 
radius of non-identifiability. It should be noted that non-identifiability is a feature common 
to many high-dimensional statistical models0 Moreover, in the more realistic setting of noisy 
observations and/or matrices that are not exactly low-rank, such approximate recovery is the 
best that can be expected. Indeed, one of our main contributions is to establish minimax- 
optimality of our rates, meaning that no algorithm can be substantially better over the matrix 
classes that we consider. 

For a given regularizer TZ, we define the quantity KdilZ) ■ = supy_^o III^IIIf/'^(^)) which 
measures the relation between the regularizer and the Frobenius norm. Moreover, we define 
the associated dual norm 

n*iU):= sup {{V, U)), (5) 
7^(^/)<l 

where {{V, U)) : = tTSice{V'^U) is the trace inner product on the space M'^^^'^^. Our estimators 
are based on constraining the interaction between the low-rank component and F* via the 
quantity 

'^7e(B) :=Krf(7^*)7^*(e). (6) 
More specifically, we analyze the family of estimators 

min{i|||y-X(e + F)|||| +Arf|||e|||N + /id^(F)}, (7) 
^For instance, see the paper [23] for discussion of non-identifiability in high-dimensional sparse regression. 
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subject to ipTii&) ^ for some fixed parameter a. 
2.3 Some examples 

Let us consider some examples to provide intuition for specific forms of the estimator ([7]), and 
the role of the additional constraint. 

Example 4 (Sparsity and elementwise £i-norm). Suppose that T* is assumed to be sparse, 
with s <^ did2 non-zero entries. In this case, the sum 0* + T* corresponds to the sum of a 
low rank matrix with a sparse matrix. Motivating applications include the problem of factor 
analysis with a non-identity but sparse noise covariance, as discussed in Example [H as well as 
certain formulations of robust PCA [7], and model selection in Gauss-Markov random fields 
with hidden variables [8]. Given the sparsity of F*, an appropriate choice of regularizer is the 
elementwise ^i-norm 

7^(^) = ||r||i:= Y.T.\'^ik\- (8) 

j=i k=i 

With this choice, it is straightforward to verify that 

71*{Z) = \\Z\\oo '■= max max \Zjk\, (9) 

j=l,...,di k=l,...,d2 

and moreover, that Kd{TZ*) = \/did2- Consequently, in this specific case, the general convex 
program ([7]) takes the form 

mill - x(e + r)|||^ + |||e|||N + f^d ||r||i} such that ||g||oo < ^t^^. (lo) 

The constraint involving ||0||oo serves to control the "spikiness" of the low rank component, 
with larger settings of a allowing for more spiky matrices. Indeed, this type of spikiness control 
has proven useful in analysis of nuclear norm relaxations for noisy matrix completion |20| . To 
gain intuition for the parameter a, if we consider matrices with |||0|||f ~ 1, as is appropriate 
to keep a constant signal-to-noise ratio in the noisy model ([1]), then setting a ~ 1 allows only 
for matrices for which \@jk\ ~ ^/V^hfh in all entries. If we want to permit the maximally 
spiky matrix with all its mass in a single position, then the parameter a must be of the order 
y/did2- In practice, we are interested in settings of a lying between these two extremes. 

* 

Past work on ^i-forms of matrix decomposition has imposed singular vector incoherence 
conditions that are related to but different from our spikiness condition. More concretely, 
if we write the SVD of the low-rank component as Q* = UDV^ where D is diagonal, and 
U S M'^i^'' and V G M'^^xr ^^^^ matrices of the left and right singular vectors. Singular vector 
incoherence bounds quantities such as 

||C/C/^- -^/rf^xdJIoo, \\VV^ - ^h^ycd^Wo., and \\UV^\\oo. (11) 

all of which measure the degree of "coherence" between the singular vectors and the canonical 
basis. A remarkable feature of such conditions is that they have no dependence on the 
singular values of G*. This lack of dependence makes sense in the noiseless setting, where 
exact recovery is the goal. For noisy models, in contrast, one should only be concerned 
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with recovering components with "large" singular values. In this context, our bound on the 
maximum element ||0*||oo) or equivalently on the quantity ||J7Dl/'^||oo, is natural. Note that 
it imposes no constraint on the matrices UU'^ or VV'^ , and moreover it uses the diagonal 
matrix of singular values as a weight in the ioo bound. Moreover, we note that there are many 
matrices for which ||0*||oo satisfies a reasonable bound, whereas the incoherence measures are 
poorly behaved (e.g., see Section 3.4.2 in the paper [20] for one example). 

Example 5 (Column-sparsity and block columnwise regularization) . Other applications in- 
volve models in which T* has a relatively small number s <^ d2 oi non-zero columns (or a 
relatively small number s <^ di of non-zero rows). Such applications include the multi-task 
regression problem from Example [21 the robust covariance problem from Example El as well 
as a form of robust PCA considered by Xu et al. [29]. In this case, it is natural to constrain 
r via the (2, l)-norm regularizer 

7^(^) = ||r||2,i := Y^WTkh, (12) 

k=l 

where is the k^^ column of T (or the (l,2)-norm regularizer that enforces the analogous 
constraint on the rows of T). For this choice, it can be verified that 

n*iU) = \\U\\2,oo ■■= max WUkh, (13) 

k=l,2,...,d2 

where denotes the k^^ column of U, and that KdiT^*) = V^h- Consequently, in this specific 
case, the general convex program ([7|) takes the form 

min{^|||y- je(e + r)|||| + Ad|||e|||N + M<il|r||2,i} such that ||e|| 2,00 < (i4) 

As before, the constraint ||0||2,oo serves to limit the "spikiness" of the low rank component, 
where in this case, spikiness is measured in a columnwise manner. Again, it is natural to 
consider matrices such that |||0*|||f ~ 1, so that the signal-to-noise ratio in the observation 
model ([I]) stays fixed. Thus, if a ~ 1, then we are restricted to matrices for which ||G*fc||2 ~ 
for all columns A; = 1, 2, . . . , ^2- At the other extreme, in order to permit a maximally 
"column-spiky" matrix (i.e., with a single non-zero column of ^2-norm roughly 1), we need 
to set a w \/d2- As before, of practical interest are settings of a lying between these two 
extremes. X 

3 Main results and their consequences 

In this section, we state our main results, and discuss some of their consequences. Our 
first result applies to the family of convex programs (jT]) whenever TZ belongs to the class of 
decomposable regularizers, and the least-squares loss associated with the observation model 
satisfies a specific form of restricted strong convexity [19j . Accordingly, we begin in Section ISTTI 
by defining the notion of decomposability, and then illustrating how the elementwise-£i and 
columnwise- (2, l)-norms, as discussed in Examples HI and [5] respectively, are both instances of 
decomposable regularizers. In Section 13. 2| we define the form of restricted strong convexity 
appropriate to our setting. Section 13.31 contains the statement of our main result about the 
M-estimator ([7|) , while Sections 13.41 and 13.61 are devoted to its consequences for the cases of 
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elementwise sparsity and columnwise sparsity, respectively. In Section 13.51 we complement 
our analysis of the convex program ([7]) by showing that, in the special case of the identity 
operator, a simple two-step method can achieve similar rates (up to constant factors). We also 
provide an example showing that the two-step method can fail for more general observation 
operators. In Section [3.71 we state matching lower bounds on the minimax errors in the case 
of the identity operator and Gaussian noise. 

3.1 Decomposable regularizers 

The notion of decomposability is defined in terms of a pair of subspaces, which (in general) 
need not be orthogonal complements. Here we consider a special case of decomposability that 
is sufficient to cover the examples of interest in this paper: 

Definition 1. Given a subspace M C ]R'^iX'^2 g.^^^ j^g orthogonal complement M-*-, a norm- 
based regularizer TZ is decomposable with respect (M, M"*") if 

7^([/ + V)= n{U) + n{V) for all [/ G M, and y G M^. (15) 

To provide some intuition, the subspace M should be thought of as the nominal model sub- 
space; in our results, it will be chosen such that the matrix F* lies within or close to M. The 
orthogonal complement M"*- represents deviations away from the model subspace, and the 
equality (fTSl) guarantees that such deviations are penalized as much as possible. 

As discussed at more length in Negahban et al. [19], a large class of norms are decom- 
posable with respect to interesting subspace pairs. Of particular relevance to us is the 
decomposability of the elementwise £i-norm ||r||i and the columnwise (2, l)-norm ||r||2,i, as 
previously discussed in Examples H] and [5] respectively. 

Decomposability of TZ{-) = || • ||i: Beginning with the elementwise £i-norm, given an 
arbitrary subset S C {1,2,..., di} x {1,2,..., ^2} of matrix indices, consider the subspace 
pair 

M{S) := {U e I jj.^ ^ Q foj. ^j^f.-^ ^ M^{S) : = (M(5))^. (16) 

It is then easy to see that for any pair U G M(5),C/' G M-'-(S'), we have the splitting 
+ = ||C^||i + II f^' 111, showing that the elementwise ^i-norm is decomposable with re- 
spect to the pair (M(5), M-L(S')). 

Decomposability of TZ{-) = \\ ■ \\2,i'- Similarly, the columnwise (2, l)-norm is also decom- 
posable with respect to appropriately defined subspaces, indexed by subsets C C {1, 2, . . . , ^2} 
of column indices. Indeed, using Vk to denote the k^^ column of the matrix V, define 

M(C) := {V £ M'^i'^'^^ I Vfe = for all k(^C}, (17) 

and M-L(C) : = (M(C))-L. Again, it is easy to verify that for any pair V G M{C),V' G M^{C), 
we have \\V + ^'Ib.i = ll^lb.i + ll^'lb.i, thus verifying the decomposability property. 

For any decomposable regularizer and subspace M 7^ {0}, we define the compatibility 
^Note that any norm is (trivially) decomposable with respect to the pair (M,M-'-) = (R'^i ^''^ , {0}). 
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constant 

M/(M,7^):= sup (18) 

U&M,Uj^O III III F 

This quantity measures the compatibihty between the Frobenius norm and the regularizer 
over the subspace M. For example, for the ^i-norm and the set M(S') previously defined (I16p . 
an elementary calculation yields 'I'(M(S'); || -111) = 

3.2 Restricted strong convexity 

Given a loss function, the general notion of strong convexity involves establishing a quadratic 
lower bound on the error in the first-order Taylor approximation [B]. In our setting, the loss 
is the quadratic function £(0) = ^fY — jC(0)|||p (where we use = + P), so that the 
first-order Taylor series error at in the direction of the matrix A is given by 

C{n + A) - £(0) - A = ^|||X(A)||||. (19) 

Consequently, strong convexity is equivalent to a lower bound of the form ^||X(A)||2 > ^|||A|||p, 
where 7 > is the strong convexity constant. 

Restricted strong convexity is a weaker condition that also involves a norm defined by the 
regularizers. In our case, for any pair (/i^, A^^) of positive numbers, we first define the weighted 
combination of the two regularizers — namely 

Q(e,r) := |||G|||N + T^7^(^). (20) 

For a given matrix A, we can use this weighted combination to define an associated norm 

$(A):=^m^Q(e,r), (21) 

corresponding to the minimum value of Q(0, F) over all decompositions of A^. 

Definition 2 (RSC). The quadratic loss with linear operator X : R'^^^'^^ jgrnxna satisfies 
restricted strong convexity with respect to the norm $ and with parameters {'^,Tn) if 

^|||X(A)|||2 > 2|||A|||2 -r„cI>2(A) for ah A G B.'^-^'^K (22) 



Note that if condition (j22p holds with r„ = and any 7 > 0, then we recover the usual 
definition of strong convexity (with respect to the Frobenius norm). In the special case of the 
identity operator (i.e., X{Q) = 0), such strong convexity does hold with 7 = 1. More general 
observation operators require different choices of the parameter 7, and also non-zero choices 
of the tolerance parameter r„. 

While RSC establishes a form of (approximate) identifiability in general, here the error A 
is a combination of the error in estimating Q* (A®) and F* (A^). Consequently, we will need 
a further lower bound on |||A|||f in terms of |||A®|||f and |||A'"|||f in the proof of our main results 
to demonstrate the (approximate) identifiability of our model under the RSC condition | 



^Defined this way, $(A) is the infimal-convolution of the two norms ||| ■ |||n and TZ, which is a very weU 
studied object in convex analysis (see e.g. |26) ) 
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3.3 Results for general regularizers and noise 



We begin by stating a result for a general observation operator X, a general decomposable 
regularizer TZ and a general noise matrix W. In later subsections, we specialize this result to 
particular choices of observation operator, regularizers, and stochastic noise matrices. In all 
our results, we measure error using the squared Frobenius norm summed across both matrices 

e2(e, f) : = lie - G*|||| + |||f - r*||||. (23) 

With this notation, the following result applies to the observation model Y = X(T* + Q*) + W, 
where the low-rank matrix satisfies the constraint ipTi{Q*) < a. Our upper bound on the 
squared Frobenius error consists of three terms 

/Ce*:=4l^ + f y. ^,m] (24a) 



ICr* := 4^^'(M;7^) + ^7^(^Mx(r))| (24b) 

^r.--=-{ E ^.(0'^) + ^^(r^Mx)|'. (24c) 

^ j=r+l 

As will be clarified shortly, these three terms correspond to the errors associated with the 
low-rank term (/Ce*), the sparse term (/Cr*), and additional error (/Cr„) associated with a 
non-zero tolerance 7^ in the RSC condition (|22p . 



Theorem 1. Suppose that the observation operator X satisfies the RSC condition (j22p with 
curvature 7 > 0, and a tolerance Tn such that there exist integers r = 1,2,... , minjdi, ^2}, 
for which 

128r„r<y, and 64t„ ( ^(M; 7^) ^ ) < (25) 
4 V Arf/ 4 

Then if we solve the convex program ([7|) with regularization parameters {Xd^l^dj satisfying 

Ad>4|||X*(I^)|||op, and > 47^*(r (I^)) + ^— , (26) 

there are universal constant Cj,j = 1,2,3 such that for any matrix pair (0*,r*) satisfying 
(fTl(Q*) < a and any TZ- decomposable pair (M, M-*"), any optimal solution {Q,T) satisfies 

e\e, f ) < ci/Ce* + C2/Cr* + C3/C,„. (27) 

Let us make a few remarks in order to interpret the meaning of this claim. 



Deterministic guarantee: To be clear, Theorem[T]is a deterministic statement that applies 
to any optimum of the convex program d?]). Moreover, it actually provides a whole family of 
upper bounds, one for each choice of the rank parameter r and each choice of the subspace 
pair (M, M"*"). In practice, these choices are optimized so as to obtain the tightest possible 
upper bound. As for the condition ([25]) . it will be satisfied for a sufficiently large sample size 
n as long as 7 > 0, and the tolerance r„ decreases to zero with the sample size. In many 
cases of interest — including the identity observation operator and multi-task cases — the RSC 
condition holds with r„ = 0, so that condition (|25p holds as long as 7 > 0. 
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Interpretation of different terms: Let us focus first on the term /Ce*, whicli corresponds 
to the complexity of estimating the low-rank component. It is further sub-divided into two 
terms, with the term r corresponding to the estimation error associated with a rank r ma- 
trix, whereas the term X^j^^+i <7j(0*) corresponds to the approximation error associated 
with representing G* (which might be full rank) by a matrix of rank r. A similar interpre- 
tation applies to the two components associated with F*, the first of which corresponds to a 
form of estimation error, whereas the second corresponds to a form of approximation error. 



A family of upper bounds: Since the inequality (I27p corresponds to a family of upper 
bounds indexed by r and the subspace M, these quantities can be chosen adaptively, depending 
on the structure of the matrices (0*,r*), so as to obtain the tightest possible upper bound. 
In the simplest case, the RSC conditions hold with tolerance = 0, the matrix 0* is exactly 
low rank (say rank r), and F* lies within a 7^-decomposable subspace M. In this case, the 
approximation errors vanish, and Theorem [1] guarantees that the squared Frobenius error is 
at most 

6^(9; f) <\lr + ^il^\M-n), (28) 
where the ;:5 notation indicates that we ignore constant factors. 

3.4 Results for £i-norm regularization 

Theorem [T] holds for any regularizer that is decomposable with respect to some subspace pair. 
As previously noted, an important example of a decomposable regularizer is the elementwise 
£i-norm, which is decomposable with respect to subspaces of the form (I16p . 



Corollary 1. Consider an observation operator X that satisfies the RSC condition (I22p with 
7 > and Tn = 0. Suppose that we solve the convex program (jlOp with regularization param- 
eters (Ad, fid) such that 



Ad>4|||r(I^)|||op, and /x^ > 4 p* (Ty)||oo + ^==. (29) 



47 a 
\/did2 



Then there are universal constants Cj such that for any matrix pair (G*,r*) with ||0*||oo ^ 
and for all integers r = 1, 2, . . . , minjdi, 1^2}, and s = 1, 2, . . . , {did2), we have 

,^(e,r)<c.|{r + l ± .,(e-)}+c,^L + J- ^ |r„l}, 

where S is an arbitrary subset of matrix indices of cardinality at most s. 



(30) 



Remarks: This result follows directly by specializing Theorem[T]to the elementwise £i-norm. 
As noted in Example HI for this norm, we have Kd = ^/dida, so that the choice ()29p satisfies the 
conditions of Theorem[TJ The dual norm is given by the elementwise £oo-norm TZ*{-) = \\ ■ ||oo- 
As observed in Section [3. H the £i-norm is decomposable with respect to subspace pairs of the 
form (M(S'),M-'-(S')), for an arbitrary subset S of matrix indices. Moreover, for any subset 
S of cardinality s, we have ^'^(M(S')) = s. It is easy to verify that with this choice, we have 
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njy[[±(r*) = ^ |r*jfc|, from which the claim fohows. 

It is worth noting the inequahty (j27p corresponds to a family of upper bounds indexed by 
r and the subset S. For any fixed integer s £ {1,2, ... , (^1^2)}, it is natural to let S index the 
largest s values (in absolute value) of T*. Moreover, the choice of the pair (r, s) can be further 
adapted to the structure of the matrix. For instance, when 0* is exactly low rank, and F* 
is exactly sparse, then one natural choice is r = rank(0*), and s = |supp(F*)|. With this 
choice, both the approximation terms vanish, and Corollary [1] guarantees that any solution 
(0,F) of the convex program (|1U|) satisfies 

|||G-ei||2 +|||f-F'^|||| < Xjr + ^^js. (31) 



Further specializing to the case of noiseless observations {W = 0), yields a form of approximate 
recovery — namely 

|||e-e^|||| + |||f -F'^ll < o^tV- (32) 

ai«2 

This guarantee is weaker than the exact recovery results obtained in past work on the noiseless 
observation model with identity operator [HE]; however, these papers imposed incoherence 
requirements on the singular vectors of the low-rank component 0* that are more restrictive 
than the conditions of Theorem [TJ 

Our elementwise ^00 bound is a weaker condition than incoherence, since it allows for 
singular vectors to be coherent as long as the associated singular value is not too large. 
Moreover, the bound (j32p is unimprovable up to constant factors, due to the non-identifiability 
of the observation model ([T]), as shown by the following example for the identity observation 
operator X = I. 

Example 6. [Unimprovability for elementwise sparse model] Consider a given sparsity index 
s S {1, 2, . . . , {did2)}, where we may assume without loss of generality that s < d2. We then 
form the matrix 



a 



y/did2 



1 1 1 



(33) 



where the vector / G has exactly s ones. Note that ||0*||oo = by construction, 

and moreover 0* is rank one, and has s non-zero entries. Since up to s entries of the noise 
matrix F* can be chosen arbitrarily, "nature" can always set F* = —0*, meaning that we 
would observe y = 0* -|- F* = 0. Consequently, based on observing only Y , the pair (0*, F*) 
is indistinguishable from the all-zero matrices (0rf^xd2' 0^1x^2)- This fact can be used to show 
that no method can have squared Frobenius error lower than ~ ; see Section 13.71 for a 
precise statement. Therefore, the bound (j32p cannot be improved unless one is willing to 
impose further restrictions on the pair (0*,F*). We note that the singular vector incoherence 
conditions, as imposed in past work [9l El E] and used to guarantee exact recovery, would 
exclude the matrix (j33l) . since its left singular vector is the unit vector ei E . Jit 
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3.4.1 Results for stochastic noise matrices 



Our discussion thus far has applied to general observation operators X, and general noise 
matrices W. More concrete results can be obtained by assuming particular forms of X, and 
that the noise matrix W is stochastic. Our first stochastic result applies to the identity 
operator X = I and a noise matrix W generated with i.i.d. A^(0, z^^/(did2)) entriesQ 

Corollary 2. Suppose X = I, the matrix 0* has rank at most r and satisfies ||0*||oo < ^did2 ' 
and r* has at most s non-zero entries. If the noise matrix W has i.i.d. N{^,v'^ /{did2)) 
entries, and we solve the convex program (fTOj) with regularization parameters 



\ 8u 8u /log(did2 4a 

Ad = ^= + ^=, and = IGi^ \ — j-j h ^==, (34) 

Vdi V«2 V '^1^2 V"l"2 

then with probability greater than 1 — exp ( — 2 log{did2)) , any optimal solution (0, T) satisfi 



es 



e{Q,T)<ciu — +C11/ — +C13-1- (35) 



did2 I V did2 ) did; 



Remarks: In the statement of this corollary, the settings of and /i^ are based on upper 
bounding ||Ty||oo and |||VK|||op5 using large deviation bounds and some non- asymptotic ran- 
dom matrix theory. With a slightly modified argument, the bound (I35p can be sharpened 
slightly by reducing the logarithmic term to log(^^). As shown in Theorem [2] to follow in 
Section [3.71 this sharpened bound is minimax-optimal, meaning that no estimator (regardless 
of its computational complexity) can achieve much better estimates for the matrix classes and 
noise model given here. 

It is also worth observing that both terms in the bound (|35p have intuitive interpretations. 
Considering first the term /Ce*; we note that the numerator term r{di + ^2) is of the order of 
the number of free parameters in a rank r matrix of dimensions di x ^2. The multiplicative 
factor corresponds to the noise variance in the problem. On the other hand, the term /Cr* 
measures the complexity of estimating s non-zero entries in a di x d2 matrix. Note that there 
are {'^^^^) possible subsets of size s, and consequently, the numerator includes a term that 

scales as log C^^/^) ~ slog{did2). As before, the multiplicative pre-factor corresponds to 

2 

the noise variance. Finally, the second term within /Cr* — namely the quantity — arises 
from the non-identifiability of the model, and as discussed in Example [H it cannot be avoided 
without imposing further restrictions on the pair (r*,0*). 

We now turn to analysis of the sparse factor analysis problem: as previously introduced 
in Example [H this involves estimation of a covariance matrix that has a low-rank plus el- 
ementwise sparse decomposition. In this case, given n i.i.d. samples from the unknown 
covariance matrix S = 0* + F*, the noise matrix W G M'^^'^ is a recentered Wishart noise 
(see equation ([3])). We can use tail bounds for its entries and its operator norm in order to 
specify appropriate choices of the regularization parameters and /i^. We summarize our 
conclusions in the following corollary: 



^To be clear, we state our results in terms of the noise scaling iy^/{did2) since it corresponds to a model 
with constant signal-to-noise ratio when the Frobenius norms of O* and F* remain bounded, independently of 
the dimension. The same results would hold if the noise were not rescaled, modulo the appropriate rescalings 
of the various terms. 
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Corollary 3. Consider the factor analysis model with n > d samples, and regularization 
parameters 




Xd = 16|||\/5]|||2a/ — , and jid = 32p(S) -v/ — ^ 1 — -, lii/iere = max,- S,-,-. (36) 

V n \ n d 

Then with probability greater than 1 — C2 exp ( — C3 log(d)) , any optimal solution (0, F) satis- 
fies 

e^(e,f)<c,{|||S||b^ + p(E)i^}+c,"'^ 



n n 



d^ • 



We note that the condition n > dis necessary to obtain consistent estimates in factor analysis 
models, even in the case with T* = Idxd where PCA is possible (e.g., see Johnstone |15j). 
Again, the terms in the bound have a natural interpretation: since a matrix of rank r in 
d dimensions has roughly rd degrees of freedom, we expect to see a term of the order 

g)^s log d subsets of size s in a d x d matrix, we also expect to 
see a term of the order Moreover, although we have stated our choices of regularization 

parameter in terms of |||S|||2 and /o(S), these can be replaced by the analogous versions using the 
sample covariance matrix S. (By the concentration results that we establish, the population 
and empirical versions do not differ significantly when n> d.) 



3.4.2 Comparison to Hsu et al. |14| 

This recent work focuses on the problem of matrix decomposition with the || • ||i-norm, and 
provides results both for the noiseless and noisy setting. All of their work focuses on the case 
of exactly low rank and exactly sparse matrices, and deals only with the identity observation 
operator; in contrast. Theorem [1] in this paper provides an upper bound for general matrix 
pairs and observation operators. Most relevant is comparison of our ^i-results with exact 
rank-sparsity constraints to their Theorem 3, which provides various error bounds (in nuclear 
and Probenius norm) for such models with additive noise. These bounds are obtained using 
an estimator similar to our program (jlOp. and in parts of their analysis, they enforce bounds 
on the ^oo-iiorm of the solution. However, this is not done directly with a constraint on as 
in our estimator, but rather by penalizing the difference \\Y — r||oo, or by thresholding the 
solution. 

Apart from these minor differences, there are two major differences between our results, 
and those of Hsu et al. First of all, their analysis involves three quantities (a, /3, 7) that 
measure singular vector incoherence, and must satisfy a number of inequalities. In contrast, 
our analysis is based only on a single condition: the "spikiness" condition on the low-rank 
component G*. As we have seen, this constraint is weaker than singular vector incoherence, 
and consequently, unlike the result of Hsu et al. , we do not provide exact recovery guarantees 
for the noiseless setting. However, it is interesting to see (as shown by our analysis) that 
a very simple spikiness condition suffices for the approximate recovery guarantees that are 
of interest for noisy observation models. Given these differing assumptions, the underlying 
proof techniques are quite distinct, with our methods leveraging the notion of restricted strong 
convexity introduced by Negahban et al. [19] . 

The second (and perhaps most significant) difference is in the sharpness of the results 
for the noisy setting, and the permissible scalings of the rank-sparsity pair (r, s). As will 
be clarified in Section [3.71 the rates that we establish for low-rank plus elementwise sparsity 
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for the noisy Gaussian model (Corollary [2]) are minimax-optimal up to constant factors. In 
contrast, the upper bounds in Theorem 3 of Hsu et al. involve the product rs, and hence 
are sub-optimal as the rank and sparsity scale. These terms appear only additively both our 
upper and minimax lower bounds, showing that an upper bound involving the product rs is 
sub-optimal. Moreover, the bounds of Hsu et al. (see Section IV. D) are limited to matrix 
decompositions for which the rank-sparsity pair (r, s) are bounded as 



log(di) log(d2) 



(37) 



This bound precludes many scalings that are of interest. For instance, if the sparse compo- 
nent r* has a nearly constant fraction of non-zeros (say s x \og(di)'\og(d2) concreteness) , 
then the bound ()37p restricts to 0* to have constant rank. In contrast, our analysis allows 
for high-dimensional scaling of both the rank r and sparsity s simultaneously; as can be 
seen by inspection of Corollary [21 our Frobenius norm error goes to zero under the scalings 
s X '^i'^2 and r x ^2 

log(rfi) log((i2) log(d2) ■ 

3.4.3 Results for multi-task regression 

Let us now extend our results to the setting of multi-task regression, as introduced in Exam- 
ple [2j The observation model is of the form Y = XB* + W, where X G ^"■^'^1 is a known 
design matrix, and we observe the matrix Y = W^^'^^ . Our goal is to estimate the the regres- 
sion matrix B* G which is assumed to have a decomposition of the form B* = 0*-|-r*, 
where 0* models the shared characteristics between each of the tasks, and the matrix F* mod- 
els perturbations away from the shared structure. If we take F* to be a sparse matrix, an 
appropriate choice of regularizer TZ is the elementwise ^i-norm, as in Corollary [2l We use (Tmin 
and Umax to denote the minimum and maximum singular values (respectively) of the rescaled 
design matrix Xj \fn\ we assume that X is invertible so that a^m > 0, and moreover, that its 
columns are uniformly bounded in ^2-iiorm, meaning that maxj=i^ i^-^ ||Xj||2 < '^maxV^- We 
note that these assumptions are satisfied for many common examples of random design. 

Corollary 4. Suppose that the matrix 0* has rank at most r and satisfies ||0*||oo ^ 



and the matrix T* has at most s non-zero entries. If the entries ofW are i.i.d. N{0,u'^), and 
we solve the convex program (jlOp with regularization parameters 

(38) 



\d = 8uams,KVn{^/di + ^/(h), and jid = 16t/Kmax \/'ralog(di(j2) + ^'^^^^^^^^ ^ 

V«l"2 



then with probability greater than 1 — exp ( — 2 log{did2)) , any optimal solution (0, F) satisfies 

e\Q,f) < c, ^ ( iS^l+m ( iM^) + ^1 . (39) 

<in \ ri J a^^.^ \ n J 0(1^2 ^ 

^ V ' V ' 

Kq* JC-p* 

Remarks: We see that the results presented above are analogous to those presented in 
Corollary [2j However, in this setting, we leverage large deviations results in order to find 
bounds on ||X*(l^)||oo and |||X*(VF)|||op that hold with high probability given our observation 
model. 
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3.5 An alternative two-step method 



As suggested by one reviewer, it is possible that a simpler two-step method — namely, based 
on first thresholding the entries of the observation matrix Y, and then performing a low-rank 
approximation — might achieve similar rates to the more complex convex relaxation (jlOp . In 
this section, we provide a detailed analysis of one version of such a procedure in the case of 
nuclear norm combined with ^i-regularization. We prove that in the special case of X = /, 
this procedure can attain the same form of error bounds, with possibly different constants. 
However, there is also a cautionary message here: we also give an example to show that the 
two-step method will not necessarily perform well for general observation operators X. 



In detail, let us consider the following two-step estimator: 
(a) Estimate the sparse component T* by solving 



r e argmin {^|||r-r|||| +/id||r||i}. (40) 



As is well-known, this convex program has an explicit solution based on soft-thresholding 
the entries of Y. 

(b) Given the estimate T, estimate the low-rank component Q* by solving the convex pro- 
gram 

e G argmin { i|||y - 6 - f|||2, + |||G|||n}. (41) 

Interestingly, note that this method can be understood as the first two steps of a blockwise co- 
ordinate descent method for solving the convex program (jlOp . In step (a) , we fix the low-rank 
component, and minimize as a function of the sparse component. In step (b), we fix the sparse 
component, and then minimize as a function of the low-rank component. The following result 
that these two steps of co-ordinate descent achieve the same rates (up to constant factors) as 
solving the full convex program ()10p : 

Proposition 1. Given observations Y from the model Y = 0* -|- r*-|-M^ with ||0*||oo < 



consider the two-step procedure ()40p and ()4ip with regularization parameters {Xd, fJ-d) such that 

4 a 

Ad > 4|||M^|||op, and fta > ^\\W\\^ + ^==. (42) 

V"l"2 

Then the error bound (I30p from CoroUaryUl holds with 7 = 1. 

Consequently, in the special case that X = /, then there is no need to solve the convex pro- 
gram (llOp to optimality; rather, two steps of co-ordinate descent are sufficient. 

On the other hand, the simple two-stage method will not work for general observation 
operators X. As shown in the proof of Proposition [H the two-step method relies criti- 
cally on having the quantity ||X(0* -|- M^)||oo be upper bounded (up to constant factors) 
by max{||0*||oo, ||W^||oo}- By triangle inequality, this condition holds trivially when X = /, 
but can be violated by other choices of the observation operator, as illustrated by the following 
example. 
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Example 7 (Failure of two-step method). Recall the multi-task observation model first intro- 
duced in Example [21 In Corollary IU we showed that the general estimator (llOp will recover 
good estimates under certain assumptions on the observation matrix. In this example, we 
provide an instance for which the assumptions of Corollary [4] are satisfied, but on the other 
hand, the two-step method will not return a good estimate. 

More specifically, let us consider the observation model Y = X{Q* + T*) + W, in which 
Y G M'^^'^, and the observation matrix X € R'^^'^ takes the form 

^ '■ = Idxd + eil^, 

where ei G M'^ is the standard basis vector with a 1 in the first component, and 1 denotes the 
vector of all ones. Suppose that the unknown low-rank matrix is given by = if 1^. Note 
that this matrix has rank one, and satisfies ||0*||oo = 3- 

We now verify that the conditions of Corollary U] are satisfied. Letting a^in and fJmax 
denote (respectively) the smallest and largest singular values of X, we have iTmin = 1 and 
Cmax ^ 2. Moreover, letting Xj denote the j^^ column of X, we have maxj=i^,,,^rf ll^jlb ^ 2. 
Consequently, if we consider rescaled observations with noise variance z^^/d, the conditions 
of Corollary [5] are all satisfied with constants (independent of dimension) , so that the M- 
estimator (llOp will have good performance. 

On the other hand, letting E denote expectation over any zero-mean noise matrix W, we 
have 

(i) (ii) 

E[||x(G* + t^)||oo] > \\xie* + E[w])\\oo =||x(e*)||oo > Vd\\Q*\\^, 

where step (i) exploits Jensen's inequality, and step (ii) uses the fact that 

||x(e*)||oo = 1/d + i/^/d = (i + \/d)||e*||oo. 

For any noise matrix W with reasonable tail behavior, the variable ||X(0* + VF)||oo will 
concentrate around its expectation, showing that ||X(0* + VF)||oo will be larger than ||0*||oo 
by an order of magnitude (factor of \/d)- Consequently, the two-step method will have much 
larger estimation error in this case. X 



3.6 Results for || ■ ||2,i regular izat ion 

Let us return again to the general Theorem [H and illustrate some more of its consequences in 
application to the columnwise (2, l)-norm previously defined in ExampleEJ and methods based 
on solving the convex program (|14p . As before, specializing Theorem [1] to this decomposable 
regularizer yields a number of guarantees. In order to keep our presentation relatively brief, 
we focus here on the case of the identity observation operator X = I. 

Corollary 5. Suppose that we solve the convex program (jl4p with regularization parameters 
(Ad, fid) such that 

4a 

Ad > 4|||W^|||op, and > 4 ||H^||2,oo + (43) 

V"2 

Then there is a universal constant c\ such that for any matrix pair (0*,r*) with ||0*||2,oo ^ 
and for all integers r = 1,2, ... ,d and s = 1,2, . . . , d2, we have 

|||0-0^|||2+|||f-r*|||2 <ciAl|r + ^ cT,(0^)|+ci/.2L + l^||rM|2|, (44) 
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where C C {1, 2, . . . , ^2} is an arbitrary subset of column indices of cardinality at most s. 

Remarks: This result follows directly by specializing Theorem [1] to the columnwise (2, 1)- 
norm and identity observation model, previously discussed in Example O Its dual norm is the 
columnwise (2,oo)-norm, and we have = yfd^- As discussed in Section [3.11 the (2, l)-norm 
is decomposable with respect to subspaces of the type M(C), as defined in equation (fTTl) . 
where C C {1, 2, . . . , ^2} is an arbitrary subset of column indices. For any such subset C of 
cardinality s, it can be calculated that ^^(M(C)) = s, and moreover, that ||njj4[±(r*)||2,i = 
X^fc^C l|r*fc||2- Consequently, the bound (jH]) follows from Theorem[TJ 

As before, if we assume that 0* has exactly rank r and F* has at most s non-zero columns, 
then both approximation error terms in the bound (j44p vanish, and we recover an upper bound 
of the form |||0 — 0*|||p + |||r — r*|||p < A^r + /i^s. If we further specialize to the case of exact 
observations iyV = 0) , then Corollary [5] guarantees that 

le-e'^l^ + lif -r*|||| < a^-f . 

«2 

The following example shows, that given our conditions, even in the noiseless setting, no 
method can recover the matrices to precision more accurate than a'^s/d2- 



Example 8 (Unimprovability for columnwise sparse model). In order to demonstrate that 
the term a'^s/d2 is unavoidable, it suffices to consider a slight modification of Example [6l In 
particular, let us define the matrix 



9" 



a 



Vdid2 



[1 1 1 



(45) 



where again the vector / G M'^^ j^ag § non-zeros, 
has s non-zero columns, and moreover 110* 



2,00 



Note that the matrix G* is rank one, 
Consequently, the matrix Q* is 
covered by Corollary [5j Since s columns of the matrix F* can be chosen in an arbitrary 
manner, it is possible that T* = —0*, in which case the observation matrix Y = 0. This fact 
can be exploited to show that no method can achieve squared Frobenius error much smaller 
than see Section [3.71 for the precise statement. Finally, we note that it is difficult to 

compare directly to the results of Xu et al. [29], since their results do not guarantee exact 
recovery of the pair (0*,F*). X 



As with the case of elementwise ^i-norm, more concrete results can be obtained when the 
noise matrix W is stochastic. 



Corollary 6. Suppose Q* has rank at most r and satisfies ||0*||2,oo < ^''^'^ 



has at 



most s non-zero columns. If the noise matrix W has i.i.d. A^(0, 1^ /(^1(^2)) entries, and we 
solve the convex program ([1 



with regularization parameters = + and 



+ 



log d2 



did2 



+ 



4a 
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then with probability greater than 1 — exp ( — 2\og{d2)) , any optimal solution {Q,T) satisfies 



e^(e,r) < ci u- 



2 r (di + ^2) 
did2 



{ 



did2 



sdi 



+ 



s log d2 



did2 



} 




(46) 




Remarks: Note that the setting of is the same as in Corollary [21 whereas the param- 
eter is chosen based on upper bounding 2,001 corresponding to the dual norm of the 
columnwise (2, l)-norm. With a slightly modified argument, the bound (j46p can be sharp- 
ened slightly by reducing the logarithmic term to log(^). As shown in Theorem [2] to follow 
in Section I3.7| this sharpened bound is minimax-optimal. 

As with Corollary [2l both terms in the bound ([46]) are readily interpreted. The term K,q* 

has the same interpretation, as a combination of the number of degrees of freedom in a rank r 

2 

matrix (that is, of the order r{di+d2)) scaled by the noise variance -g^^- The second term /Cp* 
has a somewhat more subtle interpretation. The problem of estimating s non-zero columns 
embedded within a x ^2 matrix can be split into two sub-problems: first, the problem 
of estimating the sdi non-zero parameters (in Frobenius norm), and second, the problem of 
column subset selection — i.e., determining the location of the s non-zero parameters. The es- 
timation sub-problem yields the term , whereas the column subset selection sub- problem 

incurs a penalty involving log {^^^ ~ s\ogd2-, multiplied by the usual noise variance. The final 
term a^s/d2 arises from the non-identifiability of the model. As discussed in Example [HI it is 
unavoidable without further restrictions. 



We now turn to some consequences for the problem of robust covariance estimation for- 
mulated in Example [3 As seen from equation Q, the disturbance matrix in this setting 
can be written as a sum (r*)^ + r*, where E* is a column-wise sparse matrix. Conse- 
quently, we can use a variant of the estimator (|14p . in which the loss function is given by 
|||y — {0* + (E*)-^ + E*}||||,. The following result summarizes the consequences of Theorem [1] 
in this setting: 

Corollary 7. Consider the problem of robust covariance estimation with n> d samples, based 
on a matrix 0* with rank at most r that satisfies ||0*||2.oo ^ o-f^d a corrupting matrix 
E* with at most s rows and columns corrupted. If we solve SDP (114p with regularization 
parameters 



Some comments about this result: with the motivation of being concrete, we have given an 
explicit choice (j47p of the regularization parameters, involving the operator norm |||0*|||op7 but 
any upper bound would suffice. As with the noise variance in Corollary [6l a typical strategy 
would choose this pre-factor by cross-validation. 
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3.7 Lower bounds 



For the case of i.i.d Gaussian noise matrices, Corollaries[2]and[6]provide results of an achievable 
nature, namely in guaranteeing that our estimators achieve certain Probenius errors. In this 
section, we turn to the complementary question: what are the fundamental (algorithmic- 
independent) limits of accuracy in noisy matrix decomposition? One way in which to address 
such a question is by analyzing statistical minimax rates. 

More formally, given some family T of matrices, the associated minimax error is given by 

m{T):=mi sup E[|||e-G*|||| + |||r-r*||||], (48) 
(e,r) (0*,r*) 

where the infimum ranges over all estimators (0,r) that are (measurable) functions of the 
data y, and the supremum ranges over all pairs (0*,r*) G T. Here the expectation is taken 
over the Gaussian noise matrix W ^ under the linear observation model ([T]). 

Given a matrix F*, we define its support set supp(F*) := {(j. A;) | F*jfc 7^ 0}, as well as 
its column support colsupp(F*) : = {k \ F*^ 7^ O}, where F*^ denotes the column. Using 
this notation, our interest centers on the following two matrix families: 

j;p(r,s,a) :=|(e*,F*) | rank(e*) < r, |supp(F*)| < s, ||G*||oo < ^7^}' and (49a) 

J-eol(r,s,a) :=|(e^F*) I rank(e*) < r, | colsupp(F*)| < s, ||e^||2,oo < ^j- (49b) 

By construction. Corollaries [2] and [6] apply to the families J-"sp and J-"coi respectively. 

The following theorem establishes lower bounds on the minimax risks (in squared Frobenius 
norm) over these two families for the identity observation operator: 

Theorem 2. Consider the linear observation model ([1]) with identity observation operator: 
X(0 + F) = G + F. There is a universal constant cq > such that for all a > 32ylog(did2); 
we have 

^[Fsp{r,s,a))>CQU < — + — \+(^o-r-r^ (50) 

[ did2 did2 J did2 

and 

. ^^ 2(r{di+d2) S ^ ^'^^{^)\ 

\ CLia2 «2 «1«2 / 0.2 

Note the agreement with the achievable rates guaranteed in Corollaries [2] and [6] respectively. 
(As discussed in the remarks following these corollaries, the sharpened forms of the logarithmic 
factors follow by a more careful analysis.) Theorem [5] shows that in terms of squared Frobenius 
error, the convex relaxations (jlOp and (jl4p are minimax optimal up to constant factors. 

In addition, it is worth observing that although Theorem [2] is stated in the context of 
additive Gaussian noise, it also shows that the radius of non-identifiability (involving the 
parameter a) is a fundamental limit. In particular, by setting the noise variance to zero, we 
see that under our milder conditions, even in the noiseless setting, no algorithm can estimate 

2 

to greater accuracy than cq^^, or the analogous quantity for column-sparse matrices. 
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4 Simulation results 



We have implemented the M-estimators based on the convex programs (llOp and (I14p . in 
particular by adapting first-order optimization methods due to Nesterov [22]. In this section, 
we report simulation results that demonstrate the excellent agreement between our theoretical 
predictions and the behavior in practice. In all cases, we used square matrices {d = di = ^2), 
and a stochastic noise matrix W with i.i.d. A^(0, ^) entries, with u'^ = 1. For any given rank 
r, we generated G* by randomly choosing the spaces of left and right singular vectors. We 
formed random sparse (elementwise or columnwise) matrices by choosing the positions of the 
non-zeros (entries or columns) uniformly at random. 

Recall the estimator (jlOp from Example [H It is based on a combination of the nuclear 
norm with the elementwise ^i-norm, and is motivated problem of recovering a low-rank matrix 
0* corrupted by an arbitrary sparse matrix F*. In our first set of experiments, we fixed the 
matrix dimension d = 100, and then studied a range of ranks r for 0*, as well as a range of 
sparsity indices s for F*. More specifically, we studied linear scalings of the form r = ^d for 
a constant 7 G (0, 1), and s = (3d^ for a second constant /3 G (0, 1). 

Note that under this scaling. Corollary [2] predicts that the squared Frobenius error should 
be upper bounded as C17 + C2/3 log(l//3), for some universal constants ci, C2. Figure (U^a) 
provides experimental confirmation of the accuracy of these theoretical predictions: varying 7 
(with /3 fixed) produces linear growth of the squared error as a function of 7. In Figure mb), 
we study the complementary scaling, with the rank ratio 7 fixed and the sparsity ratio (3 
varying in the interval [.01, .1]. Since /31og(l//3) ~ 0(/3) over this interval, we should expect 
to see roughly linear scaling. Again, the plot shows good agreement with the theoretical 
predictions. 



Error versus rank 
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Figure 1. Behavior of the estimator ([T0|) . (a) Plot of the squared Frobenius error e^(0,r) 
versus the rank ratio 7 € {0.05 : 0.05 : 0.50}, for matrices of size 100 x 100 and s = 2171 
corrupted entries. The growth of the squared error is linear in 7, as predicted by the theory, 
(b) Plot of the squared Frobenius error e^(0,r) versus the sparsity parameter /3 € [0.01,0.1] 
for matrices of size 100 x 100 and rank r = 10. Consistent with the theory, the squared error 
scales approximately linearly in /3 in a neighborhood around zero. 
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Squared error versus d Inverse squared error versus d 




(a) (b) 

Figure 2. Behavior of the estimator (fT4|) . (a) Plot of the squared Frobenius error e^(0,r) 
versus the dimension d G {100 : 25 : 300}, for two different choices of the rank (r = 10 and 
r = 15). (b) Plot of the inverse squared Frobenius error versus the dimension, confirming the 
linear scaling in d predicted by theory. In addition, the curve for r = 15 requires a matrix 
dimension that is 3/2 times larger to reach the same error as the curve for r = 10, consistent 
with theory. 



Now recall the estimator (I14p from Example El designed for estimating a low-rank matrix 
plus a columnwise sparse matrix. We have observed similar linear dependence on the analogs 
of the parameters 7 and /3, as predicted by our theory. In the interests of exhibiting a different 
phenomenon, here we report its performance for matrices of varying dimension, in all cases 
with r* having s = 3r non-zero columns. Figure [2]^a) shows plots of squared Frobenius 
error versus the dimension for two choices of the rank (r = 10 and r = 15), and the matrix 
dimension varying in the range d G {100 : 25 : 300}. As predicted by our theory, these plots 
decrease at the rate 1/d. Indeed, this scaling is revealed by replotting the inverse squared 
error versus d, which produces the roughly linear plots shown in panel (b). Moreover, by 
comparing the relative slopes of these two curves, we see that the problem with rank r = 15 
requires roughly a dimension that is roughly | larger than the problem with r = 10 to achieve 
the same error. Again, this linear scaling in rank is consistent with Corollary O 

5 Proofs 

In this section, we provide the proofs of our main results, with the proofs of some more 
technical lemmas deferred to the appendices. 

5.1 Proof of Theorem [1] 

For the reader's convenience, let us recall here the two assumptions on the regularization 
parameters: 

^rf>47^*(r(I^)) + ^ >0, and > 4|||X*(I^)|||op. (52) 
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Furthermore, so as to simplify notation, let us define the error matrices A® : = Q — Q* 
and a'" : = r — r*. Let (M, M"*") denote an arbitrary subspace pair for which the regularizer 
TZ is decomposable. Throughout these proofs, we adopt the convenient shorthand notation 
A^ : = nM(A'") and A^j^ = ni^±(A^), with similar definitions for r*M and r^^x. 

We now turn to a lemma that deals with the behavior of the error matrices (A®, A'") when 
measured together using a weighted sum of the nuclear norm and regularizer TZ. In order to 
state the following lemma, let us recall that for any positive {^dj^d)^ the weighted norm is 
defined as 2(9, T) := |||e|||N + if^(r). 

With this notation, we have the following: 

Lemma 1. For any r = 1,2, . . . ,d, there is a decomposition A® = A® + A® such that: 

(a) The decomposition satisfies 

rank(A5) < 2r, and {A^f A% = (A|)^ A^ = 0. 

(b) The difference 2(6*, F*) - 2(6* + A®,r* + A^) is upper bounded by 

Q(A®,Ar)-Q(A®,Arj + 2 ^ ^^.(9*) + ^ 7^(^V.). (53) 



(c) Under conditions ([52]) on o.nd Xd, the error matrices A® and A^ satisfy 

d 

Q(A®,Ar,)<3Q(A®,Ar)+4{ ^ ^,(6*) + ^ 7^(^Vx)}. (54) 

i=r+l 

for any TZ- decomposable pair (M,M-'-). 
See Appendix |X] for the proof of this result. 

Our second lemma guarantees that the cost function £(6,r) = — X(6 + r)|||| is 
strongly convex in a restricted set of directions. In particular, if we let (5£(A®, A^) denote 
the error in the first-order Taylor series expansion around (6*,r*), then some algebra shows 
that 

<5/:(A®,Ar) = i|||X(A® + Ar)|||2. (55) 

The following lemma shows that (up to a slack term) this Taylor error is lower bounded by 
the squared Frobenius norm. 

Lemma 2 (Restricted strong convexity). Under the conditions of Theorem\^ the first-order 
Taylor series error (|55p is lower hounded by 

^(|||A®|||| + |||Ar||||)-^Q(A®,Ar)-16r„{ cx,(6*) + g 7^(^Vx)}^ (56) 

j=r+l 
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We prove this result in Appendix [Bl 



Using Lemmas [T] and [2l we can now complete the proof of Theorem [TJ By the optimality 
of (S,r) and the feasibility of (0*,r*), we have 

^|||y - x(G + f )|||2 + ArfieiN + ^d7^(f ) < i|||y - x{e* + r*)\§ + Xdimh + /^^^(r*). 

Recalling that Y = X{Q* + T*) + W, and re-arranging in terms of the errors A® = Q — Q* 
and a'" = r — r*, we obtain 

i|||X(AQ + Ar)|||2 < ((A® + A^ r(l^))) + AdQ(e^n-AdQ(e* + A®,^* + A^), 



where the weighted norm Q was previously defined (120p . 

We now substitute inequality (j53p from Lemma [1] into the right-hand-side of the above 
equation to obtain 

i|||X(A« + Ar)|||| <{{A^ + A^, r(l^))) + A,Q(Ae,Ar)-A,Q(A|,Arj 

d 

+ 2Xd c7,(e^) + 2/xrf7^(^v) 

j=r+l 

Some algebra and an application of Holder's inequality and the triangle inequality allows us 
to obtain the upper bound 



||A0|||N + |||A||||N)|||r(W^)|||op + (7^(A^) + 7^(A^ J)7^*(X*(W^)) 



d 

-ArfQ(A|,A^J + ArfQ(A®,Ar) + 2A, J] cT,(e*) + 2/i,7^(^V)■ 

j=r+l 

Recalling conditions ()52p for Hd and A^, we obtain the inequality 

^|||X(A® + Ar)|||2 < ^Q(A« Ar,) + 2A, Yl ^.(e^) + 2^.7^(^'^Mx). 

j=r+l 

Using inequality (I56p from Lemma[2]to lower bound the right-hand side, and then rearranging 
terms yields 

l{\lA^\g + \lA^m < ^Q(A® A^) + ^Q(A« A^) 

d d 

+ 16r„{ Y f^,(0*) + ^^(rV^)}' + 2Ad ^ c7,(e^) + 2^d7^(^Vx). (57) 

Now note that by the triangle inequality Q(A®, A^) < Q(A^, A^) + Q(A|, A^_l), so that 
combined with the bound (j53p from Lemma [H we obtain 

d 

Q(A«,Ar)<4Q(A® Ar)+4{ ^ a,{e*) + ^^n{r\.)}. 

j=r+l 
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Substituting this upper bound into equation (j57|) yields 
|(|||A®|||| + |||Ar||||) < 4Q(A® A[,) 

d d 

+ WTn{ ^AQn + Yn^*M^)V+H>^d Yl ^,(e*)+A.d7^(^Vx)}. (58) 

j=r+l j=r+l 

Noting that A® has rank at most 2r and that A^ lies in the model space M, we find that 

XdQ{A%A^) < V2^Xd\lA^\y + ^{M)ii4AUy 

< ^/2^ Ad III A® If + 1'(M)/id|Ar|F. 

Substituting the above inequality into equation ()58p and rearranging the terms involving 
e^(A®, a"") yields the claim. 

5.2 Proof of Corollaries [2] and [4] 

Note that Corollary [2] can be viewed as a special case of Corollary HI in which n = di and 
X = ^dixdi- Consequently, we may prove the latter result, and then obtain the former result 
with this specialization. Recall that we let cimin and Umax denote (respectively) the minimum 
and maximum eigenvalues of X, and that Kmax = ^^^j=i,...,di ll^jlb denotes the maximum 
£2-norm over the columns. (In the special case X = Idxxd2i we have cTmin = CTmax = /^max = !•) 
Both corollaries are based on the regularizer, = \\ ■ ||i, and the associated dual norm 
^*(") = II ■ lloo- We need to verify that the stated choices of (Ad,//^) satisfy the require- 
ments ()29p of Corollary [TJ Given our assumptions on the pair {X,W), a little calculation 
shows that the matrix Z := X^W S j^^g independent columns, with each column 

Zj ~ A^(0, v"^ -^ Since iX'^Xllop < <7max> known results on the singular values of Gaus- 
sian random matrices [TO] imply that 

<2exp(-c(di + d2)). 

Consequently, setting A^ > ^^'^ Q'max(^?^+^/<^) gj^g^j.gg iY^qX the requirement (f26l) is satisfied. As 
for the associated requirement for /i^, it suffices to upper bound the elementwise i^o norm of 
Since the ^2 norm of the columns of X are bounded by Kmax? tbe entries of X^ W are 
i.i.d. and Gaussian with variance at most {vKma.xY /n. Consequently, the standard Gaussian 
tail bound combined with union bound yields 

P[||X^l^||oo > 4^^^ \og{did2)] < exp(-log(iirf2), 

from which we conclude that the stated choices of (A^, /U^) are valid with high probability. 
Turning now to the RSC condition, we note that in the case of multivariate regression, we 
have 

lp(A)|| = i|XA|| > ^|A||, 
showing that the RSC condition holds with 7 = 0"^!^. 

In order to obtain the sharper result for X = Id^xdi in Corollary [2] — in which log{did2) is 
replaced by the smaller quantity log(^^^) — we need to be more careful in upper bounding 
the noise term {{W, A^)). We refer the reader to Appendix IC . 1 1 for details of this argument. 



11^^ Wallop > 



{ydi + \/d2) 



n 
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5.3 Proof of Corollary [3] 

For this model, the noise matrix is recentered Wishart noise — namely, W = ^ SiLi ^i^T ~ 
where each Zi ~ N{0,T,). Letting Ui ~ N{0,ldxd) be i.i.d. Gaussian random vectors, we 
have 

1 " 1 " fd 

\iwi\op = \iv^{-Y,UiUi-idxd)V^\iop < miop\i-Y,u,ui - idxdWop < 4|||siiiopy-, 

i=l i=l 

where the final bound holds with probability greater than 1 — 2exp(— cid), using standard 
tail bounds on Gaussian random matrices [10]. Thus, we see that the specified choice (j36p of 
Xd is valid for Theorem [T] with high probability. 

We now turn to the choice of /x^. The entries of W are products of Gaussian variables, 
and hence have sub-exponential tails (e.g., [3]). Therefore, for any entry we have the 

tail bound F[\Wij\ > p(T.)t] < 2exp{-nt'^/20), valid for ah t £ (0, 1]. By union bound over 
all (P entries, we conclude that 



P[||H^||oo > 8p(S)yi^] < 2exp(-C2logd), 
which shows that the specified choice of is also valid with high probability. 

5.4 Proof of Proposition [1] 

To begin, let us recall condition (|52|) on the regularization parameters, and that, for this 
proof, the matrices (O, F) denote any optimal solutions to the optimization problems (j40p 
and ([41|) defining the two-step procedure. We again define the error matrices A® = G — 
and a'" = F — F*, the matrices A^ : = nM(A'") and A^j^ = njyi[x(A'"), and the matrices F*m 
and T*f^± as previously defined in the proof of Theorem [TJ 

Our proof of Proposition [T] is based on two lemmas, of which the first provides control on 
the error A'" in estimating the sparse component. 

Lemma 3. Under the assumptions of Proposition [71 for any subset S of matrix indices of 
cardinality at most s, the sparse error in any solution of the convex program (140 p satisfies 
the bound 

IfA'^il <cipl{s + ^ I^VD- (59) 

Proof. Since F and F* are optimal and feasible (respectively) for the convex program (j40p . 
we have 

^|||f -y|||| + /id||r||i < l|||G'^ + i^|||2 + /,^||F*||i. 

Re-writing this inequality in terms of the error A'" and re-arranging terms yields 

^WA^Wl < \{{A'^,w + Q*))\+pd\\r*h-pd\\r'' + A^h. 

By decomposability of the £i-norm, we obtain 

^WA^Wl < \{{A^, W + Q*))| + pd{\\n\\i + \\r*s4i - \\n + Aglli - + AUi} 
<\{{A^, W + e*))\ +/z42||F^.||i + ||Ar||i-||Ar.||i}, 
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where the second step is based on two appHcations of the triangle inequality. Now by applying 
Holder's inequality and the triangle inequality to the first term on the right-hand side, we 
obtain 

< mu + lie-iui + «{2||rj.||, + i\A% - \\AUh] 

= ||Ag||i{||w^||oo + ||e*||oo + M + l|A^-lli{l|w^lloo + lie-'iioo - M + 2//d||rjc||i 

< 2^rf||Ag||i + 2/irf||rj.||i, 

where the final inequality follows from our stated choice (j42p of the regularization parameter 
/irf. Since IIAglli < ^IIIAsIf < 111^'^ If, the claim ([Ml) follows with some algebra. □ 

Our second lemma provides a bound on the low-rank error A® in terms of the sparse 
matrix error A'". 

Lemma 4. If in addition to the conditions of Proposition{l\ the sparse erorr matrix is hounded 
as III a'" If < S, then the low-rank error matrix is bounded as 

|||A®|||2 < c^Xjlr + ^Y1 + ^2^'- (60) 

As the proof of this lemma is somewhat more involved, we defer it to Appendix |Dj Finally, 
combining the low-rank bound (j60p with the sparse bound ()59p from Lemma [3] yields the 
claim of Proposition [TJ 



5.5 Proof of Corollary [6] 

For this corollary, we have TZ{-) = \\ ■ \\2.1 and TZ*{-) = \\ ■ ||2,oo- In order to establish the claim, 
we need to show that the conditions of Corollary [5] on the regularization pair (A^, /x^) hold 
with high probability. The setting of is the same as Corollary O and is valid by our earlier 
argument. Hence, in order to complete the proof, it remains to establish an upper bound on 

l|W^I|2,oo. 

Let Wk be the fc*'* column of the matrix. Noting that the function 1— )• ll^^fclb is 
Lipschitz, by concentration of measure for Gaussian Lipschitz functions [16], we have 



IP[||W^fc||2 > E\\Wk\\2 + t\ < exp ( - *^^) for all t > 0. 

Using the Gaussianity of W^, we have E||14^/;||2 < ^^^^ Vdi = Applying union bound 

over all d2 columns, we conclude that with probability greater than 1 — exp( — * ^^.f'^ + log ^2) , 
we have max^ ||TVfc||2 < + t- Setting t = Aij\/^^4^ yields 



< exp( - 31ogd2), 



/d^ V did2 

from which the claim follows. 

As before, a sharper bound (with log d2 replaced by log{d2/s)) can be obtained by a refined 
argument; we refer the reader to App endix I C . 2 1 for the details. 
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5.6 Proof of Corollary [7] 

For this model, the noise matrix takes the form W : = ^ SILi UiU^ —Q* , where Ui ~ A^(0, G*). 
Since 0* is positive semidefinite with rank at most r, we can write 

W = Q{-Z,Zf -Iry,r}Q^, 

^ n 

where the matrix Q G W^^'^ satisfies the relationship 0* = QQ^ , and Zi ~ N{0,lrxr) is 
standard Gaussian in dimension r. Consequently, by known results on singular values of 
Wishart matrices [10], we have |||M^|||op < V^lll 0*111 op -\/^ with high probability, showing that 
the specified choice of Xd is valid. It remains to bound the quantity ||Vr||2,oo- By known 
matrix norm bounds |13j . we have ||TV||2,oo < III ^111 op i so that the claim follows by the previous 
argument. 

5.7 Proof of Theorem [2] 

Our lower bound proofs are based on a standard reduction [T^l Ell ED] from estimation to a 
multiway hypothesis testing problem over a packing set of matrix pairs. In particular, given a 
collection {{@^ ,T^),j = 1,2, ... , M} of matrix pairs contained in some family J-", we say that 
it forms a (5-packing in Frobenius norm if, for all distinct pairs i,j S {1, 2, . . . , M}, we have 

ll|0*-0^'lllF + ll|r*-r^'||||,>(52. 

Given such a packing set, it is a straightforward consequence of Fano's inequality that the 
minimax error over satisfies the lower bound 

p[any)4]>i- ^'^'^^;;°«^ (61) 

where I(Y; J) is the mutual information between the observation matrix Y G M'^^^'^^^ qj^^ j 
an index uniformly distributed over {1, 2, . . . , M}. In order to obtain different components of 
our bound, we make different choices of the packing set, and use different bounding techniques 
for the mutual information. 

5.7.1 Lower bounds for elementwise sparsity 

We begin by proving the lower bound ([50|) for matrix decompositions over the family /sp(r, s, a). 
Packing for radius of non-identifiability Let us first establish the lower bound involving 

2 

the radius of non-identifiability, namely the term scaling as in the case of s-sparsity for 0* . 
Recall from Example [6] the "bad" matrix (j33p . which we denote here by B*. By construction, 
we have |||i?*|||^ = ^^i^- Using this matrix, we construct a very simple packing set with M = 4 
matrix pairs (0,r): 

{{B*,-B*), {-B*,B*), {1=B*,-^B*), (0,0)} (62) 

Each one of these matrix pairs (0,r) belongs to the set J-"sp(l, s, a), so it can be used to 
establish a lower bound over this set. (Moreover, it also yields a lower bound over the sets 
J-'sp(r, s,a) for r > 1, since they are supersets.) It can also be verified that for any two 
distinct pairs of matrices in the set (j62p . they differ in squared Frobenius norm by at least 
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(5^ = ^lll-B*!^ = Let J be a random index uniformly distributed over the four possible 

models in our packing set (|6'2p . By construction, for any matrix pair (0,r) in the packing 
set, we have + T = 0. Consequently, for any one of these models, the observation matrix 
Y is simply equal to pure noise W, and hence I{Y; J) = 0. Putting together the pieces, the 
Fano bound (j6ip implies that 

Packing for estimation error: We now describe the construction of a packing set for 
lower bounding the estimation error. In this case, our construction is more subtle, based on 
the the Cartesian product of two components, one for the low rank matrices, and the other 
for the sparse matrices. For the low rank component, we re-state a slightly modified form 
(adapted to the setting of non-square matrices) of Lemma 2 from the paper |20j : 

Lemma 5. For di,d2 > 10, a tolerance 6 > 0, and for each r = 1,2, ... ,d, there exists a set 
of di X d2- dimensional matrices {Q^, . . . , 0*^} with cardinality M > j exp + |||) such 
that each matrix has rank r, and moreover 

III ill, = 52 for ain = 1,2,... ,M, (63a) 
|||G^ - > for all £^k, (63b) 



/321og(did2) f^,^iii = l,2,...,M. (63c) 

Consequently, as long as 5 < 1, we are guaranteed that the matrices belong to the set 
J^spir, s, a) for all a > 32^1og(did2). 

As for the sparse matrices, the following result is a modification, so as to apply to the 
matrix setting of interest here, of Lemma 5 from the paper [23]: 

Lemma 6 (Sparse matrix packing). For any 5 > 0, and for each integer s < did2, there 
exists a set of matrices {F^, . . . , F^} with cardinality N > exp (|logi^i^) such that 

\IT^ -T%>6^, and (64a) 
ilF^'ll^ < 8.52, (64b) 

and such that each has at most s non-zero entries. 

We now have the necessary ingredients to prove the lower bound (I50p . By combining 
Lemmas [5] and [6l we conclude that there exists a set of matrices with cardinality 

ATAT^^ did2-s rdi rd2\ , . 

MN > - exp|-log^^ + ^ + ^1 (65) 

such that 

|||(e^ F'^) - (G^', F'^')!! > 52 for all pairs such that £ ^ £' 01 k ^ k' , and (66a) 

KG^F'^)!! < foran(£,A;). (66b) 
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Let P^''^ denote the distribution of the observation matrix Y when and are the 
underlying parameters. We apply the Fano construction over the class of MN such distribu- 
tions, thereby obtaining that in order to show that the minimax error is lower bounded by 
cqS'^ (for some universal constant cq > 0), it suffices to show that 

1 ^ D(P^'fc||P^'''=') + log2 

< -, (67) 



log(MiV) - 2' 

where ©(P^''^ ||P^''^') denotes the Kullback-Leibler divergence between the distributions P^''^ 
and P^ ''^ . Given the assumption of Gaussian noise with variance v"^ /{did2), we have 

D(F||p^) = ^lll(e^^'=)-(G^^'=')|||| t (68) 

2z^^ z/^ 

where the bound (i) follows from the condition ()66bp . Combined with lower bound (|65p . we 
see that it suffices to choose S such that 

2' 



Ino- 1-1- J * Ino- <ild2~s , rdi , rd2 I 
^°g4 + i2^°g s/2 + 256 + 256 J 



For di,d2 larger than a finite constant (to exclude degenerate cases), we see that the choice 



r r •s log 



d\d2 — s 



di d2 did2 

for a suitably small constant co > is sufficient, thereby establishing the lower bound (I50p . 



5.7.2 Lower bounds for columnwise sparsity 

The lower bound (j5ip for columnwise follows from a similar argument. The only modifications 
are in the packing sets. 

Packing for radius of non-identifiability In order to establish a lower bound of order 
recall the "bad" matrix (|45p from Example El which we denote by B*. By construction, 

it has squared Frobenius norm = We use it to form the packing set 

{{B*,-B*), i-B*,B*), {1=B*,-^B*), (0,0)} (69) 

Each one of these matrix pairs (6,r) belongs to the set /coi(l) s, a), so it can be used to 
establish a lower bound over this set. (Moreover, it also yields a lower bound over the sets 
Tcoi{r, s, a) for r > 1, since they are supersets.) It can also be verified that for any two 
distinct pairs of matrices in the set (j69p . they differ in squared Frobenius norm by at least 
(5^ = I III i?* Ill |. = Consequently, the same argument as before shows that 

nn^col{l,s,a))>-—]>l ^-^ = -. 
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Packing for estimation error: We now describe packings for the estimation error terms. 
For the low-rank packing set, we need to ensure that the (2, oo)-norm is controlled. From the 
bound (|63cp . we have the guarantee 



\e%,oo<sj^^^^ foran£ = l,2,...,M, (70) 



so that, as long as (5 < 1, the matrices 0^ belong to the set Tcoil?^, s, a) for all a > 32Y^log((iid2). 

The following lemma characterizes a suitable packing set for the columnwise sparse compo- 
nent: 

Lemma 7 (Columnwise sparse matrix packing). For all d2 > 10 and integers s in the set 
{1, 2, . . . ,d2 — 1}, there exists a family di x d2 matrices {T^, k = 1,2, . . . N} with cardinality 



. fS d2-s sdi. 



satisfying the inequalities 



IIP _ r^'lll > S^, for all j / k, and (71a) 
III r^' III < 64 52, (71b) 

and such that each has at most s non-zero columns. 

This claim follows by suitably adapting Lemma 5(b) in the paper by Raskutti et al. [24j on 
minimax rates for kernel classes. In particular, we view column j of a matrix P as defining 
a linear function in dimension W^^; for each j = 1,2, . . . , di, this defines a Hilbert space Tij 
of functions. By known results on metric entropy of Euclidean balls |17j . this function class 
has logarithmic metric entropy, so that part (b) of the above lemma applies, and yields the 
stated result. 



Using this lemma and the packing set for the low-rank component and following through 
the Fano construction yields the claimed lower bound (jSOp on the minimax error for the class 
Tcoi{r, s,a), which completes the proof of Theorem [2l 



6 Discussion 



In this paper, we analyzed a class of convex relaxations for solving a general class of matrix 
decomposition problems, in which the goal is recover a pair of matrices, based on observing 
a noisy contaminated version of their sum. Since the problem is ill-posed in general, it is 
essential to impose structure, and this paper focuses on the setting in which one matrix 
is approximately low-rank, and the second has a complementary form of low-dimensional 
structure enforced by a decomposable regularizer. Particular cases include matrices that are 
elementwise sparse, or columnwise sparse, and the associated matrix decomposition problems 
have various applications, including robust PCA, robustness in collaborative filtering, and 
model selection in Gauss-Markov random fields. We provided a general non-asymptotic bound 
on the Frobenius error of a convex relaxation based on a regularizing the least-squares loss 
with a combination of the nuclear norm with a decomposable regularizer. When specialized 
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to the case of elementwise and columnwise sparsity, these estimators yield rates that are 
minimax-optimal up to constant factors. 

Various extensions of this work are possible. We have not discussed here how our estimator 
would behave under a partial observation model, in which only a fraction of the entries are 
observed. This problem is very closely related to matrix completion, a problem for which 
recent work by Negahban and Wainwright [20] shows that a form of restricted strong convexity 
holds with high probability. This property could be adapted to the current setting, and would 
allow for proving Frobenius norm error bounds on the low rank component. Finally, although 
this paper has focused on the case in which the first matrix component is approximately low 
rank, much of our theory could be applied to a more general class of matrix decomposition 
problems, in which the first component is penalized by a decomposable regularizer that is 
"complementary" to the second matrix component. It remains to explore the properties and 
applications of these different forms of matrix decomposition. 
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A Proof of Lemma [1] 

The decomposition described in part (a) was established by Recht et al. [25], so that it remains 
to prove part (b). With the appropriate definitions, part (b) can be recovered by exploiting 
Lemma 1 from Negahban et al. |19j . Their lemma applies to optimization problems of the 
general form 

mm{/:(0)+7nr(0)}, 

where £ is a loss function on the parameter space, and r is norm-based regularizer that satisfies 
a property known as decomposability. The elementwise £i-norm as well as the nuclear norm 
are both instances of decomposable regularizers. Their lemma requires that the regularization 
parameter 7„ be chosen such that 7„ > 2r*(V£(0*)), where r* is the dual norm, and VC{6*) 
is the gradient of the loss evaluated at the true parameter. 

We now discuss how this lemma can be applied in our special case. Here the relevant 
parameters are of the form 6 = (0,r), and the loss function is given by 

£(G,r) = i|||y-(e + r)||||. 

The sample size n = cP, since we make one observation for each entry of the matrix. On the 
other hand, the regularizer is given by the function 

r(0) = Q(e,r) := |||e|||N + 

coupled with the regularization parameter 7^ = Ad. By assumption, the regularizer TZ is 
decomposable, and as shown in the paper jl9j . the nuclear norm is also decomposable. Since 
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Q is simply a sum of these decomposable regularizers over separate matrices, it is also de- 
composable. 

It remains to compute the gradient V£(0*,r*), and evaluate the dual norm. A straight- 
forward calculation yields that V£(0*,r*) = [W W]'^ . In addition, it can be verified by 
standard properties of dual norms 

Q*([/,v) = |||^7|||op + -7^*(y). 

Thus, it suffices to choose the regularization parameter such that 

Xd > 2Q*{W,W) = 2|||W|||op + ^n*{w). 
Given our condition (j52p . we have 

2|||T^|||op + — 7^*W<2|||W^|||op + ^, 
meaning that it suffices to have > 4|||VF|||op, as stated in the second part of condition (j52p . 

B Proof of Lemma [2] 

By the RSC condition (j22p . we have 



^p(A® + Ar)|||2 _2|||Ae + Ar|||2 >_r„cI.2(A® + Ar) > -r„Q2(Ae, A^), (72) 

where the second inequality follows by the definitions ()20p and ()2ip of Q and $ respectively. 
We now derive a lower bound on ||| A®+A^|||i?, and an upper bound on Q^(A®, A^). Beginning 
with the former term, observe that 



^(|||A«|||| + |||Ar||||) - ^|||A« + A^il = -7((A«, A^)), 

so that it suffices to upper bound 7|((A®, A^))|. By the duality of the pair (7?., 7?.*), we have 

7|((A®, A^))! <77^*(A®)7^(A^). 

Now since and O* are both feasible for the program ([7]) and recalling that A® = — 0*, 
an application of triangle inequality yields 

77^*(A®) <7{7^*(0) + 7^*(0*)} < ^ 



K^d 



2 ' 



where inequality (i) follows from our choice of /i^. Putting together the pieces, we have shown 
that 



2|||A« + Ar||||>^(|||A«|||U|||Ar|||^)-^- 



Since the quantity Arf|||A®|||N > 0, we can write 

||||A« + Arilll > I (|||A«|||| + |||Ar|||^) - f 7^(A^) - ^IIIA^IIIn 



|(|||A«|||U|||A^||||)-^Q(A«,Ar) 
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where the latter equahty follows by the definition (j20p of Q. 

Next we turn to the upper bound on Q(A®, A'"). By the triangle inequality, we have 

Q(Ae, A^) < Q{K% A^ ) + Q(A|, A^ J. 
Furthermore, substituting in equation (f53|) into the above equation yields 

d 

Q(A«,Ar)<4Q(A®,Ar) + 4{ J] ^^.(e*) + g 7^(^V)}• (73) 

j=r+l 

Since A® has rank at most 2r and A^ belongs to the model space M, we have 

ArfQ(A^,A^) < ^/2^Ad|||A^|||F + 1'(M)/irf|||Arj|||F 
< ^/2^ Ad III A® If + ^(M)/irf|||Ar|||F. 

The claim then follows by substituting the above equation into equation ([73]) . and then sub- 
stituting the result into the earlier inequality (j72p . 

C Refinement of achievability results 

In this appendix, we provide refined arguments that yield sharpened forms of Corollaries [2] 
and [6l These refinements yield achievable bounds that match the minimax lower bounds in 
Theorem [2] up to constant factors. We note that these refinements are significantly different 
only when the sparsity index s scales as Q{did2) for Corollary [21 or as ©(^2) for Corollary [6l 

C.l Refinement of Corollary [2] 

In the proof of Theorem [H when specialized to the £i-norm, the noise term |((W, A'"))| is 
simply upper bounded by ||Vl^||oo||A^||i. Here we use a more careful argument to control this 
noise term. Throughout the proof, we assume that the regularization parameter A^ is set in 
the usual way, whereas we choose 



We split our analysis into two cases. 

Case 1: First, suppose that ||A^||i < y^||| A'"|||f. In this case, we have the upper bound 

A^))! < sup \{{W, A))| = IIIA'^IIIf sup \{{W, A))| 

l|A|ii<v^|Ar|||F l|A||i<v^ 
|||A|||F<|||Ar|||F II|A|||f<1 



Z{s) 

It remains to upper bound the random variable Z{s). Viewed as a function of W , it is a 



Lipschitz function with parameter so that 



'[Z{s) > E[Z{s)] + 6] < exp 



2i/2 



35 



Setting 52 = 1^ log(^), we have 



Z{s) < E[Z{s)] + 



did2 



log 



did2 



with probability greater than 1 — exp ( — 2slog(^^^)). 

It remains to upper bound the expected value. In order to do so, we apply Theorem 5.1(ii) 
from Gordon et al. [11] with {qo,qi) = (li2), n = did2 and t = ^/s, thereby obtaining 



E[Z{t)] < c 



y/did- 



^/^^ 2 + log 



2did2 



< c 



\/did2 



' s log 



did2 



With this bound, proceeding through the remainder of the proof yields the claimed rate. 



Case 2: Alternatively, we must have || A^^Hi > y^|||A^|||F. In this case, we need to show that 
the stated choice ([71|) of fid satisfies /irf||A'"||i > 2|((VF, A'"))! with high probability. As can 
be seen from examining the proofs, this condition is sufficient to ensure that Lemma [1] and 
Lemma [2] all hold, as required for our analysis. 
We have the upper bound 



lAHli 
IIArillF 



\{{W, A^))! < sup \{{W, A))| = WA'^WfZ 

l|A||i<||Ar||i 
ll|A|||p<|||Ar|||F 

where for any radius t > 0, we define the random variable 

Z(t):= sup \{{W, A))|. 

l|A||i<* 
ll|A|||p<l 

For each fixed t, the same argument as before shows that Z{t) is concentrated around its 
expectation, and Theorem 5.1(ii) from Gordon et al. [H] with {qo,qi) = (1,2), n = did2 
yields 



Setting 5'^ = ^did2 concentration bound, we conclude that 

Z{t) < c't- 




with high probability. A standard peeling argument (e.g., [28]) can be used to extend this 
bound to a uniform one over the choice of radii t, so that it applies to the random one 
t = ^^^J^^ of interest. (The only changes in doing such a peeling are in constant terms.) We 

|||Ar|||p \ .7 o o jr o / 

thus conclude that 
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with high probabihty. Since ||A'"||i > y^||| A'"|||f, we have 



|Ani?/ll|Ar|||| 



< and hence 



|((Ty, A^))! < |||Ar|||FZ 



II Aril 



V 

\Jd\d2 



'log 



with high probabihty. With this bound, the remainder of the proof proceeds as before. In 
particular, the refined choice ([71]) of is adequate. 

C.2 Refinement of Corollary [6] 

As in the refinement of Corollary [5] from Appendix IC.ll we need to be more careful in con- 
trolling the noise term ((VF, A'")). For this corollary, we make the refined choice of regularizer 



'log((i2/s) ^ 4a 



d\ d' 



1«2 



(75) 



As in Appendix lC.il we split our analysis into two cases. 

Case 1: First, suppose that ||Ar||2,i < -^/s ||| A^ ||| p • In this case, we have 

\{{W, A^))\ < sup \{{W, A))| = |||Ar|||F sup \{{W, A))| 

|lA||2,i<v^J|Ar|||p l|A||2,i<v^ 
ll|A|||F<|||Ar|||p 



I|A|||f<1 



Z[s) 



The function W i— >■ Z{s) is a Lipschitz function with parameter so that by con- 

centration of measure for Gaussian Lipschitz functions [16], it satisfies the upper tail bound 
¥[Z{s) > E[Z{s)]+6] < exp ( - ^) . Setting 6^ = glog(f ) yields 



Z{s) < E[Z{s)] + 2i'] 



^slog( 



d2- 



(ii d 



1"2 



(76) 



with probability greater than 1 — exp ( — 2slog(^)). 

It remains to upper bound the expectation. Applying the Cauchy-Schwarz inequality to 
each column, we have 



E[Z(s)] < E 



E 



< E 



d2 



sup 

l|A||2,l<V^ 

II|A|||f<1 



sup 

l|A||2,l<V^ 

II|A|||f<1 



^\mh\\Akh 



k=l 
d2 



"2 -1 ^ a \ 

V(||M^fc||2-E[||iyfc||2]) ||Afc||2 + sup V||Afc||2)E[||I^i||2] 
fe=l J l|A|l2,i<V^Vfc=i / 



d2 



sup J^(||T^fc||2-E[||T^fc||2]) ||Afc||2 
l|A||2,i<v/i k=i 



II|A|||f<1 



Vfc 



using the fact that E[||VFi||2] < 



d2 d2 



u 

Vd^' 
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Now the variable Vk is zero- mean, and sub-Gaussian with parameter -^==, again using 
concentration of measure for Lipschitz functions of Gaussians [16]. Consequently, by setting 
5k = ||Afc||2, we can write 



E[Z(s)] < E 



d2 

sup Vk5k 

il'5||i<4v^ k=l 



Applying Theorem 5.1(ii) from Gordon et al. [TT] with {qo,qi) = (1, 2), n = d2 and t = 4:^/s 
then yields 



which combined with the concentration bound ()76p yields the refined claim 



Case 2: Alternatively, we may assume that ||A'"||2,i > A^|||f. In this case, we need to 
verify that the choice ([75]) fid satisfies /id||A^||2,i > 2|((VF, A^))| with high probability. We 
have the upper bound 



\{{W, A^))\ < sup \{{W, A); 



IA^IIIf Z 



l|A||2,i<||Ar||2_i 
ll|A|||F<|||Ar|||p 



lA^I 



2,1 



II Aril 



where for any radius f > 0, we define the random variable 

:= sup \{{W, A))|. 

l|A||2,i<i 
II|A|||P<1 



Following through the same argument as in Case 2 of Appendix I C . 1 1 vields that for any fixed 
t > 0, we have 



^ ' - Vdld^ V J Vd^ V did2 



with high probability. As before, this can be extended to a uniform bound over t by a peeling 
argument, and we conclude that 



(W, A^))! <= IIIArillpZ 



lAH 



2,1 



II Aril 



<c||Ar||2,i { 



'2 + log 



2d2 



|Ar|||i/|||Ar|| 



1 /log(^). 



with high probability. Since 



|Ar|li,i/ll|Ar|||| 



< - by assumption, the claim follows. 
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D Proof of Lemma |4] 

Since G and Q* are optimal and feasible (respectively) for the convex program ()4ip . we have 

- e - fill + AdieiN < ^\Iy -e*- r|||| + Ad|||e*|||N. 

Recalling that Y = Q* + T* + W and re-writing in terms of the error matrices A'" = F — F* 
and = Q- Q*,we find that 

i|||A® + - Wfp + A.ie'^ + AQIn < ^lllAr - W\§ + Arf|||G*|||N. 

Expanding the Frobenius norm and reorganizing terms yields 

^|||Ae|||2 < |((A®, Ar + H^))|+A4|||ei||N-Ad|||e* + A®|||N}. 

From Lemma 1 in the paper [21], there exists a decomposition A® = A® + A® such that the 
rank of A® upper-bounded by 2 r and 

d 

|||G*|||N-|||e* + A^ + A®|||N<2 a,(e*) + |||A^|||N-|||A®|||N, 

j=r+l 

which implies that 

-|||A®||||<|((A®, Ar + iy))| + A4|||A^|||N-|||A®|||N}+2Arf ^ a,{@^) 

j=r+l 

< K(A®, Ar))| + K(A®, H^))|+A,|||A®|||N-A,|||A®|||N + 2Arf ^ ^,(9^) 

j=r+l 

(ii) 

< |||A®|||f 6 + |||A®|||n|||H^|||op + Ad|||A^|||N - Ad|||A®|||N + 2Ad Yl 

j=r+l 

(iii) 

< |||A®|||F5+|||M^|||op{|||A^|||N + |||A5lllN}+Arf|||A5|||N-Arf|||A®|||N + 2Ad Yl 

j=r+l 

d 

= |||A®|||f 6 + |||A^|||n{|||W^|||op + Ad} + |||A®|||N{|||t^|||op - A4 + 2Ad Yl 

j=r+l 

where step (i) follows by triangle inequality; step (ii) by the Cauchy-Schwarz and Holder in- 
equality, and our assumed bound ||| A^|||f < S; and step (iii) follows by substituting A® = A® -|- A® 
and applying triangle inequality. 

Since we have chosen A^ > |||H^|||op, we conclude that 



^|||A®|||| < |||A®|||f 6 + 2Ad|||A®|||N + 2Arf fl 

j=r+l 

d 

< III A® If 5 + 2AdV2^|||A®|||F + 2Ad ^ aj{Q'') 

i=r+l 
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where the second inequaUty follows since |||A®|||n < \/2r||| A®|||f < \/2r||| A® |||f. We have thus 
obtained a quadratic inequality in |||A®|||f, and applying the quadratic formula yields the 
claim. 
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